This page last changed on Mar 12, 2009 by straha1.

The qsub program is used to submit jobs to PBS, which then choses when to run those jobs on the cluster nodes. Simple use of qsub is covered on a number of pages in the HPC Compilation and Job Submission Tutorial. This page assumes you have read through the tutorial and are looking for more details on qsub's use.

There are many options (also called switches or flags) that you can send to qsub. There are two different ways to pass options to qsub that are completely equivalent. Throughout the tutorial, we have been passing options in a qsub script such as this one:

#!/bin/bash
: The above line tells Linux to use the shell /bin/bash to execute
: this script.  That must be the first line in the script.

: You must have no lines beginning with # before these
: PBS lines other than the /bin/bash line
#PBS -N 'hello_parallel'
#PBS -o 'qsub.out'
#PBS -e 'qsub.err'
#PBS -W umask=007
#PBS -q low_priority
#PBS -l nodes=5:ppn=4
#PBS -m bea

~/my_program

Then we ran qsub like this:

qsub that_above_script.qsub

However, you can do the exact same thing by passing those options to the qsub program directly and sending your script to qsub's input stream:

echo ~/my_program | qsub -N run_my_program -o qsub.out
   -e qsub.err -W umask=007 -q low_priority
   -l nodes=1:ppn=1 -m bea

(Note that the above command is all on one line.) The -N run_my_program, -m bea and other similar forms are referred to as options. You can get detailed, but cryptic, descriptions of those options in the qsub manual page, which is accessed by this command:

man qsub

I will discuss some of the simpler options here and important issues related to them below.

Option Effect
-N 'hello_parallel' This sets the name of the job; the name that shows up in the "Name" column in qstat's output. The name has no significance to the scheduler. It exists simply for your convenience.
-o qsub.out -e qsub.err This tells qsub where it should send your job's output stream and error stream, respectively. If you do not specify these, your output stream will be sent to JOB_NAME.oJOB_NUMBER where JOB_NAME is the name specified by -N and JOB_NUMBER is the job number assigned by PBS. This number is printed out when you run qsub. Similarly, the error stream is set to JOB_NAME.eJOB_NUMBER if you do not specify another location. If you do not want one of those streams, set the file name to /dev/null
-l nodes=N:ppn=P This option does not mean what you think it means; it does not request N P-processor nodes. Instead, it requests N groups of P processor cores, where each group of P processor cores is on the same machine. Thus nodes=5:ppn=2 will might give you four processors on each of two machines and two processors on a third machine, or four on one machine and two on each of three machines. This option is complex issue, and is explained in more detail below.
-m bea If this was implemented on HPC, PBS would email you when your job reaches certain states. The b means "email me when my job starts running". The e means "email me when my job exits normally." The a means "email me if my job aborts." Unfortunately, this functionality does not currently work on HPC.
-l walltime=HH:MM:SS This option sets the maximum amount of time PBS will allow your job to run before it is automatically killed. The "HH:MM:SS" should be replaced with the maximum amount of time that you will let your job run, in hours, minutes and seconds. Generally, the smaller timespan you ask for, the sooner your job will run. By default, when this page was written, PBS gave your job four hours if you did not specify the walltime.
-q queue_name Set the queue in which your job will run. Currently, the only queues are testing, low_priority and high_priority.

The nodes=5:ppn=4 is misleading since it does not request five machines with four processor cores each. The nodes=5:ppn=4 line requests five groups of four processor cores, where all four cores in each group are on the same machine. That is a subtle but important difference. If you had typed simply nodes=5 (which is equivalent to nodes=5:ppn=1), you would not be given one processor core on each of five different machines. You would get five processor cores somewhere on the cluster. Similarly, nodes=5:ppn=2 would give you five pairs of processor cores, somewhere on the cluster (but both processor cores in each pair will be on the same machine as one another).

PBS is free to allocate those sets of processors wherever it wants, and so nodes=5:nodes=2 might give you four processor cores on each of two machines and two on a third machine, or it might give you two processor cores on each of five machines, or perhaps four on one machine and two on each of three other machines. In all of those cases, PBS has done exactly what you told it to: it gave you five pairs of processor cores, where both cores in each pair are on the same machine.

Since the machines on the cluster have four processor cores each, you can ensure that you get five machines all to yourself by specifying nodes=5:ppn=4. That requests five groups of four processor cores, where all four processor cores in each group share are on the same machine as one another. Since all of our machines have exactly four processor cores, this will give you five separate machines, and ensure that nobody else's jobs are running on those machines.

Carefully consider what to choose for your ppn and nodes options. If you only need one processor and only a gigabyte or two of memory (such as for small serial jobs), you should be polite to other users and use nodes=1:ppn=1. If you are running a serial job that needs a lot of memory then you should use nodes=1:ppn=4 to ensure that no other user uses up all of the machine's memory. Parallel programs that use multiple nodes should use ppn=4 to avoid another user's job using the same Infiniband card. Our job will only run for a few seconds so it's okay to use all processors on each of five nodes.

Why Does My Job Randomly Quit after Four Hours?

Our scheduler (PBS) gives jobs a maximum wall clock time – the maximum amount of time that they can run before PBS automatically kills the job. On HPC, jobs in the low_priority queue have a default maximum wall clock time of four hours. This is to reduce the number of "runaway jobs" – jobs that were misconfigured or that used broken software and ended up running until their wall clock time ran out. You can increase the maximum wall clock time for your job using the -l walltime=HH:MM:SS option in your qsub script (see the above chart).

Making Your Job Run Sooner

You may often notice that your job sits around in the queue in the "Q" state for long periods of time before it runs. While this can be due to errors on HPC, it is usually due to one or both of two other factors.

First off, if the cluster does not have enough nodes available to run your job, then it will not run your job. If somebody submits a job that uses all of the cluster nodes for twelve hours, nobody else can run any jobs until that large job finishes. If you are trying to run a sixteen node job, and there are five four-node jobs running, taking 20 of the cluster's 33 nodes, then your job cannot run until one of those five-node jobs finishes.

The other reason relates to what happens when there are enough nodes available to run your job. When enough jobs finish so that there are sufficient cluster nodes available to run your job, the cluster needs to decide whether to run your job or somebody else's job. That decision is made by the PBS scheduler and it is based on several factors:

  1. The number of nodes your job uses. A job that takes up the entire cluster will not run very soon. Use the -l nodes=A_NUMBER:ppn=A_NUMBER option mentioned in the above table to set the number of nodes your job uses.
  2. The maximum length of time that your job claims it will take to run. If you do not explicitly tell qsub how long your job will run, it assumes your job will run for four hours. PBS prefers shorter-lived jobs over longer-lived ones. It will prefer your job over others if you request a smaller amount of time. Use the -l walltime=HH:MM:SS option in the above table to specify the amount of time your job will run. Note that PBS will automatically kill your job if it tries to run longer than the amount of time you specify.
  3. The job priority. This is dependent on which queue you use. If you use the high_priority queue, your job will probably run before jobs in the low_priority queue. Use the -q option to set the queue that your job uses. Note that this option is generally only available to researchers that have provided financial support for the cluster, plus those researchers' graduate students and associates.

Queues on HPC

There are three different queues to which you can submit your jobs. One queue is the testing queue, which is intended for short-lived test jobs for debugging. Please submit your jobs to the testing queue until you are sure that your code is working. Once it is working, you can use the low_priority or high_priority queues to run your job on more machines for longer periods of time. Here are the differences between the three queues:

Queue Job Size Notes
testing Short-lived test jobs that take up no more than one machine. Please test new programs on this queue before submitting them to the other queues. There is a node (or nodes) dedicated to test jobs, so low_priority and high_priority jobs will not get into your way here, and your test jobs will not get in the way of other people's non-test jobs.
low_priority Jobs that use up to thirty-two nodes for up to twenty-three hours The queuing system gives high_priority jobs precedence over low_priority jobs so your job may have to wait longer before it is granted access to cluster nodes.
high_priority Jobs that use up to thirty-two nodes for an unlimited amount of time As the above note mentions, high_priority jobs will get on to cluster nodes before any low_priority jobs that are queued.
Document generated by Confluence on Mar 31, 2011 15:37